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Abstract 

Hierarchical clustering based on pairwise similarities is a common tool 
used in a broad range of scientific applications. However, in many prob- 
lems it may be expensive to obtain or compute similarities between the 
items to be clustered. This paper investigates the hierarchical clustering 
of A'' items based on a small subset of pairwise similarities, significantly 
less than the complete set of A'^(A'^ — l)/2 similarities. First, we show 
that if the intracluster similarities exceed intercluster similarities, then it 
is possible to correctly determine the hierarchical clustering from as few 
as 3A'' log A'^ similarities. We demonstrate this order of magnitude savings 
in the number of pairwise similarities necessitates sequentially selecting 
which similarities to obtain in an adaptive fashion, rather than picking 
them at random. We then propose an active clustering method that is ro- 
bust to a limited fraction of anomalous similarities, and show how even in 
the presence of these noisy similarity values we can resolve the hierarchical 
clustering using only O [N log^ A'^) pairwise similarities. 

1 Introduction 

Hierarchical clustering based on pairwise similarities arises routinely in a wide 
variety of engineering and scientific problems. These problems include inferring 
gene behavior from microarray data [1] , Internet topology discovery [2] , detect- 
ing community structure in social networks [3|, advertising [4], and database 
management [SJ [5] ■ It is often the case that there is a significant cost associ- 
ated with obtaining each similarity value. For example, in the case of Internet 
topology inference, the determination of similarity values requires many probe 
packets to be sent through the network, which can place a significant burden on 
the network resources. In other situations, the similarities may be the result of 
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expensive experiments or require an expert human to perform the comparisons, 
again placing a significant cost on their collection. 

The potential cost of obtaining similarities motivates a natural question: Is 
it possible to reliably cluster items using less than the complete, exhaustive set 
of all pairwise similarities? We will show that the answer is yes, particularly un- 
der the condition that intracluster similarity values are greater than intercluster 
similarity values, which we will define as the Tight Clustering (TC) condition. 
We also consider extensions of the proposed approach to more challenging sit- 
uations in which a significant fraction of intracluster similarity values may be 
smaller than intercluster similarity values. This allows for robust, provably- 
correct clustering even when the TC condition does not hold uniformly. 

The TC condition is satisfied in many situations. For example, the TC con- 
dition holds if the similarities are generated by a branching process (or tree 
structure) in which the similarity between items is a monotonic increasing func- 
tion of the distance from the branching root to their nearest common branch 
point (ancestor) . This sort of process arises naturally in clustering nodes in the 
Internet [7]. Also notice that, for suitably chosen similarity metrics, the data 
can satisfy the TC condition even when the clusters have complex structures. 
For example, if similarities between two points are defined as the length of the 
longest edge on the shortest path between them on a nearest-neighbor graph, 
then they satisfy the TC condition given the clusters do not overlap. Addition- 
ally, density based similarity metrics [5] also allow for arbitrary cluster shapes 
while satisfying the TC condition. 

One natural approach is to try to cluster using a small subset of randomly 
chosen pairwise similarities. However, we show that this is quite ineffective in 
general. We instead propose an active approach that sequentially selects simi- 
larities in an adaptive fashion, and thus we call the procedure active clustering. 
Wc show that under the TC condition, it is possible to reliably determine the 
unambiguous hierarchical clustering of N items using at most 37V log N of the 
total of N{N — l)/2 possible pairwise similarities. Since it is clear that we 
must obtain at least one similarity for each of the N items, this is about as 
good as one could hope to do. Then, to broaden the applicability of the pro- 
posed theory and method, we propose a robust active clustering methodology 
for situations where a random subset of the pairwise similarities are unreliable 
and therefore fail to meet the TC condition. In this case, we show how using 
only O (^Nlog^ actively chosen pairwise similarities, we can still recover the 
underlying hierarchical clustering with high probability. 

Both of the clustering methodologies rely solely on the relative ordering be- 
tween the similarities, which means they are invariant to strictly monotonic 
transformations of the similarities {e.g., scaling or shifts). Therefore, these 
techniques will be automatically robust without the need for an additional pre- 
processing step for situations where similarity calibration is an issue, such as sub- 
jective human-annotated features arising in applications like the Netflix Problem 

While there have been prior attempts at developing robust procedures for 
hierarchical clustering [l0l[TTl[T2], these works do not try to optimize the number 
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of similarity values needed to robustly identify the true clustering, and mostly 
require all O {N^) similarities. Other prior work has attempted to develop 
efficient active clustering methods [131 [TH [TS] , but the proposed techniques are 
ad-hoc and do not provide any theoretical guarantees. Outside of clustering 
literature there are some interesting connections emerge between this problem 
and prior work on graphical model inference [U I16| . which we exploit here. 

The paper is organized as follows. The hierarchical clustering problem and 
our tight clustering condition is introduced in Section O In Section [31 we de- 
scribe the proposed methodology for resolving the hierarchical clustering using 
a limited number of pairwise values in the noiseless setting. Robust methods 
for clustering in the presence of similarity errors and outliers are derived in 
Section [H Finally, experiments on synthetic and real data sets are presented in 
Section [5] 

2 The Hierarchical Clustering Problem 

Let X = {xi,X2, . • . , xn} be a collection of N items. Our goal wiU be to resolve 
a hierarchical clustering of these items. 

Definition 1. A cluster C is defined as any subset of A collection of 
clusters T is called a hierarchical clustering if UdeTCi = X and for any 
Ci,Cj e T, only one of the following is true (i) Ci G Cj, (ii) Cj C C;, (Hi) 
Ci n Cj = . 

The hierarchical clustering T has the form of a tree, where each node cor- 
responds to a particular cluster. The tree is binary if for every Ck € T that 
is not a leaf of the tree, there exists proper subsets Ci and Cj of Ck, such that 
Ci n Cj = 0, and Ci U Cj = Ck- The binary tree is said to be complete if it has 

leaf nodes, each corresponding to one of the individual items. Without loss 
of generality, we will assume that T is a complete (possibly unbalanced) binary 
tree, since any non-binary tree can be represented by an equivalent binary tree. 

Let S = {si,j} denote the collection of all pairwise similarities between the 
items in X, with Sij denoting the similarity between Xi and Xj and assuming 
Si J = Sj^i. The traditional hierarchical clustering problem uses the complete 
set of pairwise similarities to infer T. In order to guarantee that T can be 
correctly identified from S, the similarities must conform to the hierarchy of T. 
We consider the following sufficient condition. 

Definition 2. The triple (X,T, S) satisfies the Tight Clustering (TC) Con- 
dition if for every set of three items {xi, Xj, Xk} such that Xi, Xj G C and Xk ^ C , 
for some C G T, the pairwise similarities satisfies, Sij > max{si^k, Sj.k)- 

In words, the TC condition implies that the similarity between all pairs 
within a cluster is greater than the similarity with respect to any item outside 
the cluster. We can consider using off-the-shelf hierarchical clustering method- 
ologies, such as bottom- up agglomerative clustering [17], on the set of pairwise 
similarities that satisfies the TC condition. Bottom-up agglomerative clustering 
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is a recursive process that begins with singleton clusters {i.e., the A'' individual 
items to be clustered). At each step of the algorithm, the pair of most similar 
clusters are merged. The process is repeated until all items are merged into a 
single cluster. It is easy to see that if the TC condition is satisfied, then the 
standard bottom-up agglomerative clustering algorithms such as single linkage, 
average linkage and complete linkage will all produce T given the complete 
similarity matrix S. Various agglomerative clustering algorithms differ in how 
the similarity between two clusters is defined, but every technique requires all 
N{N — l)/2 pairwise similarity values since all similarities must be compared 
at the very first step. 

To properly cluster the items using fewer similarities requires a more sophis- 
ticated adaptive approach where similarities are carefully selected in a sequen- 
tial manner. Before contemplating such approaches, we first demonstrate that 
adaptivity is necessary, and that simply picking similarities at random will not 
suffice. 

Proposition 1. Let T be a hierarchical clustering of N items and consider 
a cluster of size m in T for some m <^ N . If n pairwise .similarities, with 
n < —{N — 1), are selected uniformly at random from the pairwise similarity 
matrix S, then any clustering procedure will fail to recover the cluster with high 
probability. 

Proof. In order for any procedure to identify the m-sizcd cluster, we need to 
measure at least to — 1 of the ('^ ) similarities between the cluster items. Let 
P ~ (2*) / i'2) probability that a randomly chosen similarity value will be 

between items inside the cluster. If we uniformly sample n similarities, then the 
expected number of similarities between items inside the cluster is approximately 
n('2 ) / (^) (for m <ti N). Given HoefFding's inequality, with high probability the 
number of observed pairwise similarities inside the cluster will be close to the 
expected value. It follows that we require '^■(™)/(^) ~ ^ jv(jv-i) ^ ™ ~ Ij 
and therefore we require n > ^ {N — 1) to reconstruct the cluster with high 
probability. □ 

This result shows that if we want to reliably recover clusters of size to = N'^ 
(where a £ [0,1]), then the number of randomly selected similarities must exceed 
7V(^~") (A^— 1). In simple terms, randomly chosen similarities will not adequately 
sample all clusters. As the cluster size decreases (i.e., as a — ?> 0) this means that 
almost all pairwise similarities are needed if chosen at random. This is more 
than are needed if the similarities are selected in a sequential and adaptive 
manner. In Section |3l we propose a sequential method that requires at most 
3 A'' log N pairwise similarities to determine the correct hierarchical clustering. 
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3 Active Hierarchical Clustering under the TC 
Condition 



From Proposition [1] it is clear that unless wc acquire almost all of the pairwise 
similarities, reconstruction of the clustering hierarchy when sampling at random 
will fail with high probability. In this section, we demonstrate that under the 
assumption that the TC condition holds, an active clustering method based 
on adaptively selected similarities enables one to perform hierarchical clustering 
efficiently. Towards this end, we consider the work in |16| where the authors are 
concerned with a very different problem, namely, the identification of causality 
relationships among binary random variables. We present a modified adaptation 
of prior work here in the context of our hierarchical clustering from pairwise 
similarities problem. 

From our discussion in the previous section, it is easy to see that the problem 
of reconstructing the hierarchical clustering T of a given set of items X = 
{xi,X2, . ■ . , xn} can be reinterpreted as the problem of recovering a binary tree 
whose leaves are {xi, X2, ■ . ■ , xm}- In |16| . the authors define a special type of 
test on triples of leaves called the leadership test which identifies the "leader" of 
the triple in terms of the underlying clustering tree structure. A leaf Xk is said 
to be the leader of the triple {xi,Xj,Xk) if the path from the root of the tree 
to Xk does not contain the nearest common ancestor of Xi and Xj. An example 
of this property can be seen in Figure [TJ This prior work shows that one can 
efficiently reconstruct the entire tree T using only these leadership tests. 
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Figure 1: Tree structure where Xk is the leader of the triple {xi,Xj,Xk)- 

The following lemma demonstrates that given observed pairwise similarities 
satisfying the TC condition, an outlier test using pairwise similarities will 
correctly resolve the leader of a triple of items. 

Lemma 1. Let "K be a collection of items equipped with pairwise similarities S 
and hierarchical clustering T. For any three items {xi,Xj,Xk} from X, define 

max(sij, Si,fc) < Sj^k 
max(sij,Sj^fc) < Si^k (1) 
max(si^fc,Sj^fc) < Sjj 

If (X.,T,S) satisfies the TC condition, then outlier (xi,Xj,Xk) coincides with 
the leader of the same triple with respect to the tree structure conveyed by T. 



outlleT {xi,Xj,Xk) = 
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Proof. Suppose that Xk is the leader of the triple with respect to T. This occurs 
if and only if there is a cluster C G T such that Xi,Xj G C and Xk G T\C. By 
the TC condition, this implies that s^j > niax(si^fe, Sj^fe)- Therefore Xk is the 
outlier of the same triple. □ 

Note that outlier relics only on the ordering of similarity values, and there- 
fore it is invariant to monotonic transformations of the similarities. More pre- 
cisely, let / be a strictly monotonic function and define /(S) := {f{sij)} to be 
the set of pairwise similarities under this transformation. If (X, T, S) satisfies 
the TC condition, then so does (X, T, / (S)). In words, the TC condition does 
not require precise calibration of similarity values. 

The clustering algorithm we propose is called OUTLIERcluster , and is given 
below in Algorithm [T] The procedure is based on the outlier test and a tree 
reconstruction algorithm due to [16]. In Theorem 13. 1[ we show that the algo- 
rithm determines the correct hierarchical clustering using O(A^logA^) pairwise 
similarities. 

Theorem 3.1. Assume that the triple (X,T, S) satisfies the Tight Clustering 
(TC) condition where T is a complete (possible unbalanced) binary tree that is 
unknown. Then, OUTLIERcluster recovers T exactly using at most SiVlogg^j ^ 
adaptively selected pairwise similarity values. 

Proof. From Appendix II of (TB], we find a methodology that requires at most 
log3/2 A^ leadership tests to exactly reconstruct the unique hierarchical clus- 
tering of A^ items. Lemma [1] shows that under the TC condition, each leader- 
ship test can be performed using only 3 adaptively selected pairwise similarities. 
Therefore, we can reconstruct the hierarchical clustering T from a set of items 
X using at most 3A''log3^2 ^ adaptively selected pairwise similarity values. □ 

3.1 Tight Clustering Experiments 

In Table [T] we see the results of both clustering techniques (OUTLIERcluster 
and bottom-up aggiomerative clustering) on various synthetic tree topologies 
given the Tight Clustering (TC) condition. The performance is in terms of 
the number of pairwise similarities required by the aggiomerative clustering 
methodology, denoted by riagg, and the number of similarities required by our 
OUTLIERcluster method, noutUer- The methodologies are performed on both a 
balanced binary tree of varying size (A^ = 128, 256, 512) and a synthetic Internet 
tree topology generated using the technique from |18j . As seen in the table, our 
technique resolves the underlying tree structure using at most 11% of the pair- 
wise similarities required by the bottom-up aggiomerative clustering approach. 
As the number of items in the topology increases, further improvements are 
seen using OUTLIERcluster. Due to the pairwise similarities satisfying the TC 
condition, both methodologies resolve a binary representation of the underlying 
tree structure exactly. 
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Algorithm 1 - DUTLIERcluster(X, S) 



Given : 

1. set of items, X = {xi, a;2, ...xat}. 

2. matrix of pairwisc similarities, S. 
Clustering Process : 

Initialize clustering tree T ~ {xi, 2:2, {a;i, a;2}}. 
For i = {3,4,...,iV} 

1. Set tree % — T- 

2. While nmnber of items in 7i, denoted #7i, is greater than 2 

(a) Select a subtree, Tc (1 %, such that the number of items in the 
subtree, #7b, satisfies ^ < #Tc < 

(b) Find items Xj,Xk € Tc such that no subtrees in Tc (besides Tc) 
contain both Xj , x^ ■ 

(c) If: Xi = ontller {xi,Xj,Xk), replace 7c in 7i with item xj, 
Else: set % ^Tc- 

3. Let Xj,Xk be the two remaining items in Ti- 

4. Let T' be the smallest subtree in T containing both Xj,Xk- This 
subtree contains TJ,T^, such that Xj e TJ , xu G T^f, T' =TJ[JTi^, 
and7;'n'7i: = 0. 

5. If: 0Titlier{xi,Xj,Xk) = Xi, replace T' in T with {T', {a^i}}- 
Elseif: outlier(xi, Xj, Xfe) = Xj, replace 7^' in T with {7^f, {xi}}. 
Else: outlier(a;i, Xj,Xk) = x^, replace Tj in T with {7^', {xi}}. 

Output : Hierarchical cluster tree, T. 



Table 1: Comparison of OUTLIERcluster and Agglomerative Clustering on var- 
ious topologies satisfying the Tight Clustering condition. 



Topology 


Size 


nagg 


'^^outlier 




Balanced 
Binary 


N = 128 


8,128 


876 


10.78% 


N = 256 


32,640 


2,206 


6.21% 


N = 512 


130,816 


4,561 


3.49% 


Synthetic Internet 


TV = 768 


294,528 


8,490 


2.88% 



3.2 Fragility of OUTLIERcluster 

OUTLIERcluster determines the correct clustering hierarchy when all the pair- 
wise similarities are consistent with the hierarchy T, but it can fail if one or 
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more of the pairwise similarities are inconsistent. In Figure [5J we examine the 
performance of OUTLIERcluster on a balanced binary tree (with N = 256) 
when a small number of calls to outlier in OUTLIERcluster return an incor- 
rect item with respect to the underlying hierarchy T. The incorrect cases are 
selected at random. This performance measured in terms of Tmin, the size of 
the smallest correctly resolved cluster (i.e., all clusters of size Vmin or larger are 
reconstructed correctly for the clustering) averaged over 150 separate experi- 
ments. As seen in the table, with only two outlier tests erroneous at random, 
we find that this corrupts the clustering reconstruction using OUTLIERcluster 
significantly. This can attributed to the greedy construction of the clustering 
hierarchy using this methodology, where if one of the initial items is incorrectly 
placed in the hierarchy, this will result in a cascading effect that will drastically 
reduce the accuracy the clustering. 




Number of incorrect outlier tests 



Figure 2: Fragility of OUTLIERcluster when a selected number of outlier tests 
are incorrect for a balanced binary tree of size N ~ 256. 



4 Robust Active Clustering 

Suppose that most, but not all, of the outlier tests agree with T. This may 
occur if a subset of the similarities are in some sense inconsistent, erroneous 
or anomalous. Wc will assume that a certain subset of the similarities produce 
correct outlier tests and the rest may not. These similarities that produce 
correct tests are said to be consistent with the hierarchy T. Our goal is to 
recover the clusters of T despite the fact that the similarities are not always 
consistent with it. 

Definition 3. The subset of consistent similarities is denoted Sc C S. 
These similarities satisfy the following property: if Si j^Sj^j^^Si^k G then 
outlier (xi, Xj, Xfe) returns the leader of the triple {xi,Xj,Xk) in T (i.e., the 
outlier test is consistent with respect to T). 

We adopt the following probabilistic model for Sc. Each similarity in S 
fails to be consistent independently with probability at most g < 1/2 {i.e., 
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membership in Sc is termed by repeatedly tossing a biased coin). The expected 
cardinaUty of Sc is E[|Sc|] > (1 — 9)181. Under this model, there is a large 
probability that one or more of outlier tests will yield an incorrect leader with 
respect to T. Thus, our tree reconstruction algorithm in Section |3] will fail to 
recover the tree with large probability. We therefore pursue a different approach 
based on a top-down recursive clustering procedure that uses voting to overcome 
the effects of incorrect tests. 

The key element of the top-down procedure is a robust algorithm for cor- 
rectly splitting a given cluster in T into its two subclusters, presented in Algo- 
rithm 1. Roughly speaking, the procedure quantifies how frequently two items 
tend to agree on outlier tests drawn from a small random subsample of other 
items. If they tend to agree frequently, then they are clustered together; other- 
wise they are not. We show that this algorithm can determine the correct split 
of the input cluster C with high probability. The degree to which the split is 
"balanced" affects performance, and we need the following definition. 

Definition 4. Let C be any non-leaf cluster in T and denote its subclusters by 
Cl and Cr; i.e., Cl{^Cr = and ClIJC/j = C. The balance factor of C is 
r,c:=min{\CL\,\CR\}\\C\. 

Theorem 4.1. Let < S' < 1 and threshold 7 e (0, 1/2). Consider a cluster 
C & T with balance factor rjc > rj and disjoint subclusters Cr and Cl, and 
assume the following conditions hold: 

u Al - The pairwise similarities are consistent with probability at least l — q, 
for some < 1 , ^ 

• A2 - q, 7] satisfy (1 - (1 - q)'^) < 7 < (1 - q)^ri 
If m > Co log(47i/(5') and n > 2m (where the constant cq depends on q, 7, 77 
and 5'), then with probability at least \ ~ 5' the output 0/ split(C, m, (5') is the 
correct subclusters, Cn and Cl- 

The proof of the theorem is given in the Appendix. The theorem above 
shows that the algorithm is guaranteed (with high probability) to correctly split 
clusters that are sufficiently large for a certain range of q and 77, as specified by 
A2. A bound on the constant cq is given in Equation [3] in the proof, but the 
important fact is that it does not depend on n, the number of items in C. Thus 
all but the very smallest clusters can be reliably split. Note that total number of 
similarities required by split is at most Smn. So if we take m = cq log(47i/(5'), 
the total is at most 3cQnlog{4:n/ S') . The key point of the lemma is this: instead 
of using all 0{n^) similarities, split only requires 0{nlogn). 

The allowable range in A2 is non-degenerate and covers an interesting regime 
of problems in which q is not too large and rj is not too small, this is shown in 
Figure [3l The allowable range of 7 cannot be determined without knowledge of 
77 and q, so in practice 7 € (0, 1/2) is a user-selected parameter (we use 7 = 0.30 
in all our experiments in the following section), and the Theorem holds for the 
corresponding set of {q,ri) in A2. 
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Algorithm 2 ; split(C, m, 7) 



Input : 

1. A single cluster C consisting of n items. 

2. Parameters m < n/2 and 7 G (0, 1/2) 
Initialize : 

1. Select two subsets Sv,Sa C C uniformly at random (with replacement) 
containing m items each. 

2. Select a "seed" item Xj G C uniformly at random and let Cj G {Cii,Cl} 
denote the subcluster it belongs to. 

Split : 

• For each Xi € C and Xk G Sa \ Xi, compute the outlier fraction of Sy'- 

1 

''~ \Sv\{x,,Xk}\ ^ l{outller(..,.„.,)=.a 

xieSv\{xi,Xk] 

where 1 denotes the indicator function. 

• Compute the outlier agreement on Sa'- 



(-'-{Ci,fc>7 and C3,fc>7} 

Xk^SA\{Xi,Xj} 

+ (l{c.,fc<7-dc,,,<7}) /\SA\{X^,XJ}\ 

• Assign item Xi to a subcluster according to 



Xi G 



Cj : if a; J > 1/2 
C/Cj : if < 1/2 



Output : subclusters Cj, C/Cj 



We now give our robust active hierarchical clustering algorithm, RAcluster. 
Given an initial single cluster of items, the split methodology of Algorithm 1 
is recursively performed until all subclusters are of size less than or equal to 
2m, the minimum resolvable cluster size where we can overcome inconsistent 
similarities through voting. The output is a hierarchical clustering T' ■ The 
algorithm is summarized in Algorithm [3l 

Theorem 14.11 shows that it suffices to use O(nlogn) similarities for each call 
of split, where n is the size of the cluster in each call. Now if the splits are 
balanced, the depth of the complete cluster tree will be 0(log A), with 0(2^) 
calls to split at level I involving clusters of size n — 0{N/2^). An easy calcu- 
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Success Guaranteed 



0.1 0.2 0.3 0.4 0.5 

V 

Figure 3: The shaded region depicts the range of q and r/ for which Theorem l4.1l 
can guarantee correct recovery of subclusters using the split algorithm. 



Algorithms: RAcluster(C, m, 7) 
Given : 

1. C, n items to be hierarchically clustered. 

2. parameters m < n/2 and 7 G (0, 1/2) 
Partitioning : 

1. Find {Cl,Cr} = split(C, m, 7). 

2. Evaluate hierarchical subtrees, Tl,Tr, of cluster C using: 

^ ( RAcluster(CL, m, 7) : iflC^I > 2m 
1 Cl ■ otherwise 

_ _ J RAcluster(CR, m, 7) : if |Cfl| > 2m 
[ Cr : otherwise 

Output : Hierarchical clustering T' = {71, Tr} containing subclusters of size 
> 2m. 



lation then shows that the total number of similarities required by RAcluster 
is then 0(A^log^ A^), compared to the total number which is 0{N^). The per- 
formance guarantee for the robust active clustering algorithm are summarized 
in the following main theorem. 

Theorem 4.2. Let "K. be a collection of N items with underlying hierarchical 
clustering structure T and let < 5 < 1. If m ~ fcolog(|Af), for a constant 
ko > 0, then RAcluster(X, m, 7) uses 0{N log'^ N) similarities and with prob- 
ability at least 1 — S recovers all clusters C € T that have size > 2m, with 
balance factor rjc > rj, and satisfy Al holding with 5' = — , , , , 1 7 and A 2 of 

Theorem \4-l\ 
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The proof of the theorem is given in the Appendix. The constant ko is 
specified in Equation 3. Roughly speaking, the theorem implies that under 
the conditions of the Theorem 14.11 wc can robustly recover all clusters of size 
O(logA^) or larger using only 0(7Vlog^A^) similarities. Comparing this result 
to Thcorcm l3.11 we note three costs associated with being robust to inconsistent 
similarities: 1) wc require 0{N log^ N) rather than 0{N log N) similarity values; 
2) the degree to which the clusters are balanced now plays a role (in the constant 
rj); 3) we cannot guarantee the recovery of clusters smaller than O(logA^) due 
to voting. 

5 Robust Clustering Experiments 

To test our robust clustering methodology wc focus on experimental results 
from a balanced binary tree using synthesized similarities and a real-world data 
set using genetic microarray data ([H]). The synthetic binary tree experiments 
allows us to observe the characteristics of our algorithm while controlling the 
amount of inconsistency with respect to the Tight Clustering (TC) condition, 
while the real world data gives us perspective on problems where the tree struc- 
ture and TC condition is assumed, but not known. 

In order to quantify the performance of the tree reconstruction algorithms, 
consider the non-unique partial ordering, tt : {1, 2, N} — {1, 2, N}, result- 
ing from the ordering of items in the reconstructed tree. For a set of observed 
similarities, given the original ordering of the items from the true tree structure 
we would expect to find the largest similarity values clustered around the diag- 
onal of the similarity matrix. Meanwhile, a random ordering of the items would 
have the large similarity values potentially scattered away from the diagonal. 
To assess performance of our reconstructed tree structures, we will consider 
the rate of deca^ for similarity values off the diagonal of the reordered items, 
= jTTd ^i=i S7r(i),7r(i+d)- Using Srf, WC define a distribution over the the av- 
erage off-diagonal similarity values, and compute the entropy of this distribution 
as follows: 

i=l 

Where p^, = (j2d=i ^rf) Si- 

This entropy value provides a measure of the quality of a partial order- 
ing induced by the tree reconstruction algorithm. For a balanced binary tree 
with N=512, we find that for the original ordering, E [n original) = 2.2323, 
and for the random ordering, E {i^random) = 2.702. This motivates examin- 
ing the estimated A-entropy of our clustering reconstruction-based orderings as 
Ea (tt) = E (TTrandom) — E (tt), whcrc WC normalize the reconstructed clustering 
entropy value with respect to a random permutation of the items. The quality of 
our clustering methodologies will be examined, where the larger the estimated 
A-entropy, the higher the quality of our estimated clustering. 
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For the synthetic binary tree experiments, we created a balanced binary tree 
with 512 items. We generated similarity between each pair of items such that 
100 • (1 — q)% of the pairwisc similarities chosen at random are consistent with 
the TC condition (e Sc)- The remaining 100 • q% of the pairwise similarities 
were inconsistent with the TC condition. We examined the performance of 
both standard bottom-up agglomerative clustering and our Robust Clustering 
algorithm, RAcluster, for pairwisc similarities with q = 0.05,0.15,0.25. The 
results presented here are averaged over 10 random realization of noisy synthetic 
data and setting the threshold 7 = 0.30. We used the similarity voting budgets 
TO = 40 and m = 80, which require 38% and 65% of the complete set of simi- 
larities, respectively. Performance gains are shown using our robust clustering 
approach in Table [2] in terms of both the estimated A-entropy and in , the 
size of the smallest correctly resolved cluster (where all clusters of size r,„i„ 
or larger are reconstructed correctly for the clustering). Comparisons between 
A-entropy and r„iin show a clear correlation between high A-cntropy and high 
clustering reconstruction resolution. While we make no theoretical guarantees 
for reconstruction when the clusters are smaller than m or when the properties 
of the clusters {q,i]) are outside the feasible region of Figure |3l these results 
show that these clusters can be possibly resolved in practice. 



Tabic 2: Clustering A-entropy results for synthetic binary tree with 512 
for Agglomerative Clustering and RAcluster. 





Agglo. 


Robust 




Robust 




Clustering 


(m=40) 




(m=80 


) 


q 


A-Entropy 




A-Entropy 




A-Entropy 




0.05 


0.3666 


460.8 


1.0178 




6.8 


1.0178 


7.2 


0.15 


0.0899 


512 


1.0161 




16 


1.0161 


15.2 


0.25 


0.0133 


512 


0.9360 




384 


1.0119 


57.6 



In terms of a real world data set, we test our methodologies against a set of 
gene microarray data. Our genetic dataset [19] consists of 1,024 yeast genes with 
7 expressions each, from which we exhaustively generate the standard Pearson 
correlation using the expression vectors for every pair of genes. Our robust clus- 
tering methodology is performed on the datasets using the threshold 7 — 0.30 
and similarity voting budgets m = 10 and to = 80, requiring at most 18% and 
61% of the total similarities (when N = 512), respectively. The results in Table[3] 
(averaged over 10 random permutations of the datasets) show that again our 
robust clustering methodology outperforms agglomerative clustering in terms 
of estimated A-entropy of the reordered elements, while also not requiring to 
observe every pairwise similarity value. In Figure |4] we see the reordered sim- 
ilarity matrices given both agglomerative clustering and our robust clustering 
methodology, RAcluster. 
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(A) (B) 

Figure 4: Reordered pairwise similarity matrices, Gene mieroarray data with 
N — 1024 using (A) - Agglomerative Clustering and (B) - Robust Clustering 
using m = 80 (requiring only 43% of the similarities). An ideal clustering 
would organize items so that the similarity matrix is dark blue (high similarity) 
clusters/bloeks on the diagonal and light blue (low similarity) values off the 
diagonal blocks. The robust clustering is clearly closer to this ideal {i.e., B 
compared to A). 

Table 3: A-entropy results for real world gene mieroarray dataset using both 
Agglomerative Clustering and Robust Clustering algorithms. 



Dataset 


Agglo. 


Robust 
(m=10) 


Robust 
(m=80) 


Gene (N=512) 


0.1417 


0.1912 


0.2035 


Gene (N=1024) 


0.0761 


0.1325 


0.1703 



6 Conclusions 

Despite the wide ranging applications of hierarchical clustering (biology, net- 
working, scientific simulation), relatively little work has been done on examining 
the number of pairwise similarity values needed to resolve the hierarchical de- 
pendencies in the presence of noise. The goal of our work was to use drastically 
fewer selected pairwise similarity values to reduce the total number of pairwise 
similarities needed to resolve the hierarchical dependency structure while re- 
maining robust to potential outliers in the data. When there are no outliers, 
we presented a methodology that requires no more than 3A^log3 N similarity 
values to recover the clustering hierarchy. We then showed that in the presence 
of inconsistent similarity values we only require on the order of O (^N log^ 
similarities to robustly recover the clustering. These results open hierarchical 
clustering to a new realm of large-scale problems that were previously imprac- 
tical to evaluate. 



14 



7 Appendix 



7.1 Proof of Theorem 14.11 

Since the outlier tests can be erroneous, we instead use a two-round voting 
procedure to correctly determine whether two items Xi and Xj arc in the same 
sub-cluster or not. Please refer to Algorithm 1 for definitions of the relevant 
quantities. The following lemma establishes that the outlier fraction values Ci^k 
can reveal whether two items Xi, Xk are in the same subcluster or not, provided 
that the number of voting items m = \Sv\ is large enough and the similarity 
Si^k is consistent (later we will show that a second round of voting can remove 
this requirement). 

Lemma 2. Consider two items Xi andxk- Under assumptions Al and A.2 and 

assuming Si^k G Sc, comparing the outlier count values Ci^k to a threshold 7 will 
correctly indicate whether Xi,Xk are in the same subcluster with probability at 
least 1 — ^ for 

logiA/ Sc) 



m > 



2 min 



Proof. Let fli^k ■= Ijs; k&Sc} t>e the event that the similarity between items 
Xi,Xk is in the consistent subset (sec Definition [3]) . Under Al, the expected 
outlier fraction (c^ fe) conditioned on Xi,Xk and fli,k can be bounded in two 
cases; when they belong to the same subcluster and when they do not: 

E [cj,fc I Xi,Xk e CLOTXi,Xk e Cr , rij,fe] > (1 - qfri 



E [ci,fc I Xi e Cr, Xk e Cl or Xi e CL,xk e Cr^ , 

<(l-{l-qf 



i,k\ 



A2 stipulates a gap between the two bounds. Hoeffding's Inequality ensures 
that, with high probability, Ci^k will not significantly deviate below/above the 
lower/upper bound. Thresholding Ci^k at a level 7 between the bounds will 
probably correctly determine whether Xi and Xk are in the same subcluster or 
not. More precisely, if 

log(4/^c) 



2 min ((7 - 1 + (1 - qyf , ((1 - q)^r, - jf) ' 

then with probability at least (1 — (5c/2) the threshold test correctly determines 
if the items are in the same subcluster. 

□ 

Next, note that we cannot use the cluster count Ci_j directly to decide the 
placement of .t^ since the condition Sij G Sc may not hold. In order to be robust 
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to errors in Sij, we employ a second round of voting based on an independent 
set of m randomly selected agreement items, Sa- The agreement fraction, ai^j, 
is the average of the number of times the item Xi agrees with the clustering 
decision of Xj on Sa ■ 

Lemma 3. Consider the following procedure: 



Xi e 



Cj : if a, J > 
Cj •• if tti^j < 



Under assumptions Al and A2, with probability at least 1 — the above 
procedure based on m = \Sa\ agreement items will correctly determine if the 
items Xi,Xj are in the same subcluster, provided 

. log(4/<5c) 
m > 



2 



Proof. Define as the event that similarities Si^k,Sj^k are both consistent {i.e., 
Si,k,Sj^k S Sc) and thresholding the cluster counts Ci^k,Cj^k at level 7 correctly 
indicates if the underlying items belong to the same subcluster or not. Then 
using Lemma [3] and the union bound wc can bound the probabilities, 

P > {l-Sc){l-qf 

P{^Z) < l-il-Sc)il-qf 

Then the conditional expectations of the agreement counts, a^.j, can be bounded 
as, 

EK,|x, ^C,] < P {<^f^^) < 1 - (1 - qf {I - Sc) 
E[a,^j\x,eCj] > P {<i>rj) > {1 - qf {1 - Sc) 



Since q < 1 — 1/ \/2(l — S') and Sc = S'/n (as defined below), there is a gap 
between these two boimds that includes the value 1/2. Hoeffding's Inequality 
ensures that with high probability a^.j will not significantly deviate above/below 
the upper/lower bound. Thus, thresholding a^.j at 1/2 will resolve whether the 
two items Xi , Xj are in the same or different subclusters with probability at least 

(1 - Sc/2) provided m > log(4/5c)/ (2 ((1 - ,5c) (1 - q)' - l)] ■ □ 



By combining Lemmas [2] and [31 we can state the following. The split 
methodology of Algorithm 1 will successfully determine if two items 
in the same subcluster with probability at least 1 — Sc under assumption Al 
and A2, provided 



m > max 



^ log(4/fe) log(4/,5c) ^ 

^2 ((1 - Sc) (1 - g)' - i)" 2 min ((7 - 1 + (1 - q)^)' , ((1 - q)^fj - 7)') ^ 
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and the cluster under consideration has at least 2m items. 

In order to successfully determine the subcluster assignments for all n items 
of the cluster C with probability at least 1 — 6' , requires setting Sc — ^ {i.e., 
taking the union bound over all n items). Thus we have the requirement 

m > Co ((5',r/,(7,7)log(4r7,/(5') 

where the constant obeys 

/ 



1 



2 min ((7 - 1 + (1 - q^f , ((1 - q^V - if) 



(3) 



Finally, this result and assumptions A1-A2 imply that the algorithm split(C, m, 7) 
correctly determine the two subclustcrs of C with probability at least 1 — S' . 



7.2 Proof of Theorem 1472] 

Lemma 4. A binary tree with N leaves and balance factor rjc > rj has depth of 
at most L < logiV/ log(Yi;^). 

Proof. Consider a binary tree structure with N leaves (items) with balance 
factor rj < 1/2. After depth of £, the number of items in the largest cluster are 
bounded by (1 — 77) N. If L denotes the maximum depth level, then there can 
only be 1 item in the largest cluster after depth of L, we have 1 < (1 ~ v)^ ^ 

The entire hierarchical clustering can be resolved if all the clusters are re- 
solved correctly. With a maximum depth of L, the total number of clusters 
M in the hierarchy is bounded by J2e=o 2^ < 2(^+i) < 2iV^/'°s(T^)^ ^gij^g i^j^e 
result of Lemma 21 Therefore, the probability that some cluster in the hierarchy 
is not resolved < MS' < 2N^^^°^''T^^S' (where split succeeds with probability 
> 1 — S'). Therefore, for all clusters (which satisfy the conditions Al and A2 
of Theorem 14.11 and have size > 2m) can be resolved with probability 1 — 5 (by 
setting 6' = — / 1 ), from the proof of Theorem 14. 1[ we require, 

2N ' 1-1' 



m > max 



log(f7Vl + (l/^°s(T^») 



l+(l/loE(y^)) 



2N 



a-q) 



2 1 



l0g(fiVl + (l/'°ST^)) 



2min((7-l + (l-g)2)^((l-g)2,,-7)' 
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which holds if to = fco((5, 77, q, 7) log (f A'') where, 



ko{S, 7?, g, 7) > co{S, 77, g, 7)/ (^1 + [1/ log j j j (4) 

Given this choice of m, we find that the RAcluster methodology in Algo- 
rithm [3] for a set of N items will resolve all clusters that satisfy Al and A2 of 
Theorem 14. II and have size > 2m, with probability at least 1 — S. 

Furthermore, the algorithm only requires O (iVlog^ iV) total pairwise simi- 
larities. By running the RAcluster methodology, using the result from LemmalU 
each item will have the split methodology performed at most log A^/ log (j^^) 
times (i.e., once for each depth level of the hierarchy). If to = A:o((5, ry, q, 7) log (f -/V) 
for RAcluster, each call to split will require only 3fco(5, r], g, 7) log (f A^) pair- 
wise similarities per item. Given N total items, we find that the RAcluster 
methodology requires at most 3fco((5, 77, g, 7)A^ iog°^^^ ) log (f -^) pahwise simi- 
larities. 
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